AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.
You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
To predict whether a liability customer will buy a personal loan or not. Which variables are most significant. Which segment of customers should be targeted more. Data Dictionary
Importing required libraries
# Library to suppress warnings or deprecation notes
import warnings
warnings.filterwarnings("ignore")
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# Library to split data
from sklearn.model_selection import train_test_split
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# Libraries to build decision tree classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.linear_model import LogisticRegression
# To tune different models
from sklearn.model_selection import GridSearchCV
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
plot_confusion_matrix,
make_scorer,
roc_auc_score,
precision_recall_curve,
roc_curve,
)
Import the dataset
#Read dataset using panda function read_csv
data = pd.read_csv("Loan_Modelling.csv")
df = data.copy()
Understand the Dataset
Let's start by performing basic steps to understand the data such as:
Review first and last few rows of the dataset
#Display first few rows of the dataset
df.head()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.60 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.50 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.00 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.70 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.00 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
The dataset loaded without any issues
#Display last few rows of the dataset
df.tail()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4995 | 4996 | 29 | 3 | 40 | 92697 | 1 | 1.90 | 3 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4996 | 4997 | 30 | 4 | 15 | 92037 | 4 | 0.40 | 1 | 85 | 0 | 0 | 0 | 1 | 0 |
| 4997 | 4998 | 63 | 39 | 24 | 93023 | 2 | 0.30 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4998 | 4999 | 65 | 40 | 49 | 90034 | 3 | 0.50 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4999 | 5000 | 28 | 4 | 83 | 92612 | 3 | 0.80 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
Total no of rows and columns information.
# checking the shape of the data using f-string
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")
There are 5000 rows and 14 columns.
Get the datatype information of the columns
#check data types of columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null int64 5 Family 5000 non-null int64 6 CCAvg 5000 non-null float64 7 Education 5000 non-null int64 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null int64 10 Securities_Account 5000 non-null int64 11 CD_Account 5000 non-null int64 12 Online 5000 non-null int64 13 CreditCard 5000 non-null int64 dtypes: float64(1), int64(13) memory usage: 547.0 KB
Checking for missing values in the data
df.isnull().sum()
ID 0 Age 0 Experience 0 Income 0 ZIPCode 0 Family 0 CCAvg 0 Education 0 Mortgage 0 Personal_Loan 0 Securities_Account 0 CD_Account 0 Online 0 CreditCard 0 dtype: int64
Checking for duplicate values in the dataset
df[df.duplicated()].count()
ID 0 Age 0 Experience 0 Income 0 ZIPCode 0 Family 0 CCAvg 0 Education 0 Mortgage 0 Personal_Loan 0 Securities_Account 0 CD_Account 0 Online 0 CreditCard 0 dtype: int64
Getting the statistical summary for the dataset
df.describe(include='all').T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ID | 5000.00 | 2500.50 | 1443.52 | 1.00 | 1250.75 | 2500.50 | 3750.25 | 5000.00 |
| Age | 5000.00 | 45.34 | 11.46 | 23.00 | 35.00 | 45.00 | 55.00 | 67.00 |
| Experience | 5000.00 | 20.10 | 11.47 | -3.00 | 10.00 | 20.00 | 30.00 | 43.00 |
| Income | 5000.00 | 73.77 | 46.03 | 8.00 | 39.00 | 64.00 | 98.00 | 224.00 |
| ZIPCode | 5000.00 | 93169.26 | 1759.46 | 90005.00 | 91911.00 | 93437.00 | 94608.00 | 96651.00 |
| Family | 5000.00 | 2.40 | 1.15 | 1.00 | 1.00 | 2.00 | 3.00 | 4.00 |
| CCAvg | 5000.00 | 1.94 | 1.75 | 0.00 | 0.70 | 1.50 | 2.50 | 10.00 |
| Education | 5000.00 | 1.88 | 0.84 | 1.00 | 1.00 | 2.00 | 3.00 | 3.00 |
| Mortgage | 5000.00 | 56.50 | 101.71 | 0.00 | 0.00 | 0.00 | 101.00 | 635.00 |
| Personal_Loan | 5000.00 | 0.10 | 0.29 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| Securities_Account | 5000.00 | 0.10 | 0.31 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| CD_Account | 5000.00 | 0.06 | 0.24 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| Online | 5000.00 | 0.60 | 0.49 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 |
| CreditCard | 5000.00 | 0.29 | 0.46 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 |
Get count unique values in each column
df.nunique()
ID 5000 Age 45 Experience 47 Income 162 ZIPCode 467 Family 4 CCAvg 108 Education 3 Mortgage 347 Personal_Loan 2 Securities_Account 2 CD_Account 2 Online 2 CreditCard 2 dtype: int64
df.drop(['ID'], axis=1, inplace=True)
df.head()
| Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | 1 | 49 | 91107 | 4 | 1.60 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 45 | 19 | 34 | 90089 | 3 | 1.50 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 39 | 15 | 11 | 94720 | 1 | 1.00 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 35 | 9 | 100 | 94112 | 1 | 2.70 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 35 | 8 | 45 | 91330 | 4 | 1.00 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
# Lets unique value counts of binary and, Education,Family, Zipcode columns
cols = ['ZIPCode','Family','Education','Personal_Loan', 'Securities_Account',
'CD_Account', 'Online', 'CreditCard']
# Printing the count of unique categorical levels in each column
for column in cols:
print(df[column].value_counts(dropna=False))
print("-" * 50)
94720 169
94305 127
95616 116
90095 71
93106 57
...
96145 1
94087 1
91024 1
93077 1
94598 1
Name: ZIPCode, Length: 467, dtype: int64
--------------------------------------------------
1 1472
2 1296
4 1222
3 1010
Name: Family, dtype: int64
--------------------------------------------------
1 2096
3 1501
2 1403
Name: Education, dtype: int64
--------------------------------------------------
0 4520
1 480
Name: Personal_Loan, dtype: int64
--------------------------------------------------
0 4478
1 522
Name: Securities_Account, dtype: int64
--------------------------------------------------
0 4698
1 302
Name: CD_Account, dtype: int64
--------------------------------------------------
1 2984
0 2016
Name: Online, dtype: int64
--------------------------------------------------
0 3530
1 1470
Name: CreditCard, dtype: int64
--------------------------------------------------
Observations
Age: The age of customers range from 23 to 67yrs and average is 45 yrs old.Experience: Cusomers work experienc range -3 to 43 yrs old. Looks like there is misinterpretation of work experience has negtive values. Income: Customer income from 8K to 224K annually with average income of 74K.ZIP Code: Majority of customers are from zipcode '94720' followed by '94305' and '95616'Family:Family size range frm 1 to 4 members with average of 2 members.CCAvg: Customers spend 0 to 10,000 dollars per month with average spending of 1900 dollars.Education: Majority of cusmtomers Undergrad degree follwed by Advanded/Progessional and GraduatesMortgage: Mortgage range from 0 to 635K and average of 56K (in thousand dollars)Personal_Loan: Did this customer accept the personal loan offered in the last campaign? True =1 and False =0 - our Target variable, data shows 480 accept the loan and 4520 didn't accept.Securities_Account: Does the customer have securities account with the bank? True =1 and False=0. there are about 522 customers have account and 4478 dont have account with the bankCD_Account: Does the customer have a certificate of deposit (CD) account with the bank? True = 1 and False =0, around 302 have CD's and 4698 dont have CD.Online: Do customers use internet banking facilities?2984 customers use online servicesCreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)?there are about 1470 customers hold credit cards,Lets check data distribution on numerical columns
#create a variable to get list of numerical columns
num_col=['Age','Experience','Income','CCAvg','Mortgage']
#Create a function to combined histogram and boxplot graphing
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
#Create to both histgram and boxplot graph for all numerical columns.
for i in num_col:
histogram_boxplot(df, i) # Call hist_box function for each column.
Observations
Age
Experience
Income
CCAvg
Mortgage
Observations on Categorical Attributes
# function to create barplots categorical variables
def cat_bar(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts(ascending=False).index
#order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
catcols = ['Family','Education','Personal_Loan', 'Securities_Account',
'CD_Account', 'Online', 'CreditCard']
#Create to both histgram and boxplot graph for all numerical columns.
for j in catcols:
cat_bar(df, j, perc=True) # Call hist_box function for each column.
#ZIPCode
plt.figure(figsize=(20, 5))
sns.countplot(x="ZIPCode", data=df, order=df.ZIPCode.value_counts().index[0:20]);
plt.xticks(rotation=90);
plt.show()
Observations On categorical columns
Family
Education
Income
Personal_Loan
Security_Account
CD_Account
Online
CreditCard
ZIPCode
plt.figure(figsize=(15, 7))
sns.heatmap(df.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
sns.pairplot(data=df, hue="Personal_Loan")
plt.show()
cols = df[
[
"Age",
"Income",
"Experience",
"ZIPCode",
"Family",
"CCAvg",
"Education",
"Mortgage",
"Securities_Account",
"CD_Account",
"Online",
"CreditCard"
]
].columns.tolist()
plt.figure(figsize=(12, 12))
for i, variable in enumerate(cols):
plt.subplot(5, 4, i + 1)
sns.boxplot(df["Personal_Loan"], df[variable], palette="PuBu", showfliers=False)
plt.tight_layout()
plt.title(variable)
plt.show()
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 6))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
Personal_Loan with some of the categorical columns in our data¶stacked_barplot(df, "Family", "Personal_Loan")
Personal_Loan 0 1 All Family All 4520 480 5000 4 1088 134 1222 3 877 133 1010 1 1365 107 1472 2 1190 106 1296 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "Education", "Personal_Loan")
Personal_Loan 0 1 All Education All 4520 480 5000 3 1296 205 1501 2 1221 182 1403 1 2003 93 2096 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "Securities_Account", "Personal_Loan")
Personal_Loan 0 1 All Securities_Account All 4520 480 5000 0 4058 420 4478 1 462 60 522 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "CD_Account", "Personal_Loan")
Personal_Loan 0 1 All CD_Account All 4520 480 5000 0 4358 340 4698 1 162 140 302 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "Online", "Personal_Loan")
Personal_Loan 0 1 All Online All 4520 480 5000 1 2693 291 2984 0 1827 189 2016 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "CreditCard", "Personal_Loan")
Personal_Loan 0 1 All CreditCard All 4520 480 5000 0 3193 337 3530 1 1327 143 1470 ------------------------------------------------------------------------------------------------------------------------
# functions to treat outliers by flooring and capping
def treat_outliers(df, col):
"""
Treats outliers in a variable
df: dataframe
col: dataframe column
"""
Q1 = df[col].quantile(0.25) # 25th quantile
Q3 = df[col].quantile(0.75) # 75th quantile
IQR = Q3 - Q1
Lower_Whisker = Q1 - 1.5 * IQR
Upper_Whisker = Q3 + 1.5 * IQR
# all the values smaller than Lower_Whisker will be assigned the value of Lower_Whisker
# all the values greater than Upper_Whisker will be assigned the value of Upper_Whisker
df[col] = np.clip(df[col], Lower_Whisker, Upper_Whisker)
return df
def treat_outliers_all(df, col_list):
"""
Treat outliers in a list of variables
df: dataframe
col_list: list of dataframe columns
"""
for c in col_list:
df = treat_outliers(df, c)
return df
#Outlier treated data will store on df_trol dataframe.
numerical_col = ['Age','Experience','Income','CCAvg','Mortgage']
df_trol = treat_outliers_all(df, numerical_col)
# let's look at box plot to see if outliers have been treated or not
plt.figure(figsize=(20, 30))
for i, variable in enumerate(numerical_col):
plt.subplot(5, 4, i + 1)
plt.boxplot(df_trol[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
Creating training and test sets.
df_dummies = pd.get_dummies(df_trol,drop_first=True)
df_dummies.head()
| Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | 1 | 49.00 | 91107 | 4 | 1.60 | 1 | 0.00 | 0 | 1 | 0 | 0 | 0 |
| 1 | 45 | 19 | 34.00 | 90089 | 3 | 1.50 | 1 | 0.00 | 0 | 1 | 0 | 0 | 0 |
| 2 | 39 | 15 | 11.00 | 94720 | 1 | 1.00 | 1 | 0.00 | 0 | 0 | 0 | 0 | 0 |
| 3 | 35 | 9 | 100.00 | 94112 | 1 | 2.70 | 2 | 0.00 | 0 | 0 | 0 | 0 | 0 |
| 4 | 35 | 8 | 45.00 | 91330 | 4 | 1.00 | 2 | 0.00 | 0 | 0 | 0 | 0 | 1 |
X = df_dummies.drop(["Personal_Loan"], axis=1)
Y = df_dummies["Personal_Loan"]
# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1
)
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set : (3500, 12) Shape of test set : (1500, 12) Percentage of classes in training set: 0 0.91 1 0.09 Name: Personal_Loan, dtype: float64 Percentage of classes in test set: 0 0.90 1 0.10 Name: Personal_Loan, dtype: float64
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn_with_threshold(model, predictors, target, threshold=0.5):
"""
Function to compute different metrics, based on the threshold specified, to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
# predicting using the independent variables
pred_prob = model.predict_proba(predictors)[:, 1]
pred_thres = pred_prob > threshold
pred = np.round(pred_thres)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1,
},
index=[0],
)
return df_perf
# defining a function to plot the confusion_matrix of a classification model built using sklearn
def confusion_matrix_sklearn_with_threshold(model, predictors, target, threshold=0.5):
"""
To plot the confusion_matrix, based on the threshold specified, with percentages
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
pred_prob = model.predict_proba(predictors)[:, 1]
pred_thres = pred_prob > threshold
y_pred = np.round(pred_thres)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
# There are different solvers available in Sklearn logistic regression
# The newton-cg solver is faster for high-dimensional data
lg = LogisticRegression(solver="newton-cg", random_state=1)
model = lg.fit(X_train, y_train)
# creating confusion matrix
confusion_matrix_sklearn_with_threshold(lg, X_train, y_train)
log_reg_model_train_perf = model_performance_classification_sklearn_with_threshold(
lg, X_train, y_train
)
print("Training performance:")
log_reg_model_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.96 | 0.65 | 0.84 | 0.74 |
# creating confusion matrix
confusion_matrix_sklearn_with_threshold(lg, X_test, y_test)
log_reg_model_test_perf = model_performance_classification_sklearn_with_threshold(
lg, X_test, y_test
)
print("Test set performance:")
log_reg_model_test_perf
Test set performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.95 | 0.61 | 0.83 | 0.70 |
Predicts the probabilities for the class 0 and 1.
Input: Train or test data
Output: Returns the predicted probabilities for class 0 and 1
Returns the auc scores
Input:
1. Training data
2. Predicted Probability
Output: AUC scores between 0 and 1
Returns the fpr, tpr and threshold values which takes the original data and predicted probabilities for the class 1.
Input:
1. Training data
2. Predicted Probability
Output: False positive rate, true positive rate and threshold values
# Find the roc auc score for training data
logit_roc_auc_train = roc_auc_score(
y_train, lg.predict_proba(X_train)[:, 1]
) # The indexing represents predicted probabilities for class 1
# Find fpr, tpr and threshold values
fpr, tpr, thresholds = roc_curve(y_train, lg.predict_proba(X_train)[:, 1])
plt.figure(figsize=(7, 5))
# Plot roc curve
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
# Find the roc auc score for training data
logit_roc_auc_test = roc_auc_score(
y_test, lg.predict_proba(X_test)[:, 1]
) # The indexing represents predicted probabilities for class 1
# Find fpr, tpr and threshold values
fpr, tpr, thresholds = roc_curve(y_test, lg.predict_proba(X_test)[:, 1])
plt.figure(figsize=(7, 5))
# Plot roc curve
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_test)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
Optimal thresold is the value that best separated the True positive rate and False positive rate.
# Optimal threshold as per AUC-ROC curve
# The optimal cut off would be where tpr is high and fpr is low
# roc_curve returns the fpr, tpr and threshold values which takes the original data and predicted probabilities for the class 1.
fpr, tpr, thresholds = roc_curve(
y_train, lg.predict_proba(X_train)[:, 1]
) # The indexing represents predicted probabilities for class 1
optimal_idx = np.argmax(
tpr - fpr
) # Finds the index that contains the max difference between tpr and fpr
optimal_threshold_auc_roc = thresholds[
optimal_idx
] # stores the optimal threshold value
print(optimal_threshold_auc_roc)
0.1261949190272398
# creating confusion matrix
confusion_matrix_sklearn_with_threshold(
lg, X_train, y_train, threshold=optimal_threshold_auc_roc
)
# checking model performance for this model
log_reg_model_train_perf_threshold_auc_roc = model_performance_classification_sklearn_with_threshold(
lg, X_train, y_train, threshold=optimal_threshold_auc_roc
)
print("Training performance:")
log_reg_model_train_perf_threshold_auc_roc
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.91 | 0.88 | 0.52 | 0.65 |
# creating confusion matrix
confusion_matrix_sklearn_with_threshold(
lg, X_test, y_test, threshold=optimal_threshold_auc_roc
)
# checking model performance for this model
log_reg_model_test_perf_threshold_auc_roc = model_performance_classification_sklearn_with_threshold(
lg, X_test, y_test, threshold=optimal_threshold_auc_roc
)
print("Test set performance:")
log_reg_model_test_perf_threshold_auc_roc
Test set performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.92 | 0.85 | 0.55 | 0.67 |
The Precision-Recall curve shows the tradeoff between Precision and Recall for different thresholds. It can be used to select optimal threshold as required to improve the model improvement.
Returns the fpr, tpr and threshold values
Input:
1. Original data
2. Predicted Probability
Output: False positive rate, true positive rate and threshold values
# Find the predicted probabilities for class 1
y_scores = lg.predict_proba(X_train)[:, 1]
# Find fpr, tpr and threshold values
prec, rec, tre = precision_recall_curve(y_train, y_scores,)
def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
plt.plot(thresholds, precisions[:-1], "b--", label="precision")
plt.plot(thresholds, recalls[:-1], "g--", label="recall")
plt.xlabel("Threshold")
plt.legend(loc="upper left")
plt.ylim([0, 1])
plt.figure(figsize=(10, 7))
# Plot recall precision curve
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()
# setting the threshold
optimal_threshold_curve = 0.38
# creating confusion matrix
confusion_matrix_sklearn_with_threshold(
lg, X_train, y_train, threshold=optimal_threshold_curve
)
log_reg_model_train_perf_threshold_curve = model_performance_classification_sklearn_with_threshold(
lg, X_train, y_train, threshold=optimal_threshold_curve
)
print("Training performance:")
log_reg_model_train_perf_threshold_curve
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.95 | 0.73 | 0.76 | 0.74 |
# creating confusion matrix
confusion_matrix_sklearn_with_threshold(
lg, X_test, y_test, threshold=optimal_threshold_curve
)
log_reg_model_test_perf_threshold_curve = model_performance_classification_sklearn_with_threshold(
lg, X_test, y_test, threshold=optimal_threshold_curve
)
print("Test set performance:")
log_reg_model_test_perf_threshold_curve
Test set performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.95 | 0.67 | 0.76 | 0.71 |
# training performance comparison
models_train_comp_df = pd.concat(
[
log_reg_model_train_perf.T,
log_reg_model_train_perf_threshold_auc_roc.T,
log_reg_model_train_perf_threshold_curve.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Logistic Regression sklearn",
"Logistic Regression-ROC Curve",
"Logistic Regression-ROC Curve with Threshold",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Logistic Regression sklearn | Logistic Regression-ROC Curve | Logistic Regression-ROC Curve with Threshold | |
|---|---|---|---|
| Accuracy | 0.96 | 0.91 | 0.95 |
| Recall | 0.65 | 0.88 | 0.73 |
| Precision | 0.84 | 0.52 | 0.76 |
| F1 | 0.74 | 0.65 | 0.74 |
# testing performance comparison
models_test_comp_df = pd.concat(
[
log_reg_model_test_perf.T,
log_reg_model_test_perf_threshold_auc_roc.T,
log_reg_model_test_perf_threshold_curve.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Logistic Regression sklearn",
"Logistic Regression-ROC Curve",
"Logistic Regression-ROC Curve with Threshold",
]
print("Test set performance comparison:")
models_test_comp_df
Test set performance comparison:
| Logistic Regression sklearn | Logistic Regression-ROC Curve | Logistic Regression-ROC Curve with Threshold | |
|---|---|---|---|
| Accuracy | 0.95 | 0.92 | 0.95 |
| Recall | 0.61 | 0.85 | 0.67 |
| Precision | 0.83 | 0.55 | 0.76 |
| F1 | 0.70 | 0.67 | 0.71 |
If the frequency of class A is 10% and the frequency of class B is 90%, then class B will become the dominant class and the decision tree will become biased toward the dominant classes.
In this case, we can pass a dictionary {0:0.15,1:0.85} to the model to specify the weight of each class and the decision tree will give more weightage to class 1.
class_weight is a hyperparameter for the decision tree classifier.
dtree = DecisionTreeClassifier(
criterion="gini", class_weight={0: 0.15, 1: 0.85}, random_state=1
)
dtree.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, random_state=1)
confusion_matrix_sklearn_with_threshold(dtree, X_train, y_train)
dtree_model_train_perf_Prepune = model_performance_classification_sklearn_with_threshold(
dtree, X_train, y_train
)
print("Training performance:")
dtree_model_train_perf_Prepune
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.00 | 1.00 | 1.00 | 1.00 |
confusion_matrix_sklearn_with_threshold(dtree, X_test, y_test)
dtree_model_test_perf_Prepune = model_performance_classification_sklearn_with_threshold(
dtree, X_test, y_test
)
print("Test performance:")
dtree_model_test_perf_Prepune
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.98 | 0.89 | 0.88 | 0.88 |
## creating a list of column names
feature_names = X_train.columns.to_list()
plt.figure(figsize=(20, 30))
out = tree.plot_tree(
dtree,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(dtree, feature_names=feature_names, show_weights=True))
|--- Income <= 98.50 | |--- CCAvg <= 2.95 | | |--- weights: [374.10, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- CD_Account <= 0.50 | | | |--- CCAvg <= 3.95 | | | | |--- Income <= 81.50 | | | | | |--- Age <= 36.50 | | | | | | |--- Family <= 3.50 | | | | | | | |--- CCAvg <= 3.50 | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | | |--- CCAvg > 3.50 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- Family > 3.50 | | | | | | | |--- ZIPCode <= 92308.50 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | |--- ZIPCode > 92308.50 | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | |--- Age > 36.50 | | | | | | |--- ZIPCode <= 91269.00 | | | | | | | |--- ZIPCode <= 90974.00 | | | | | | | | |--- weights: [1.05, 0.00] class: 0 | | | | | | | |--- ZIPCode > 90974.00 | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | |--- ZIPCode > 91269.00 | | | | | | | |--- Income <= 52.50 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | |--- Income > 52.50 | | | | | | | | |--- weights: [5.40, 0.00] class: 0 | | | | |--- Income > 81.50 | | | | | |--- ZIPCode <= 95041.50 | | | | | | |--- Mortgage <= 152.00 | | | | | | | |--- ZIPCode <= 91335.50 | | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | | | |--- ZIPCode > 91335.50 | | | | | | | | |--- CCAvg <= 3.05 | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | |--- CCAvg > 3.05 | | | | | | | | | |--- Education <= 1.50 | | | | | | | | | | |--- Experience <= 20.50 | | | | | | | | | | | |--- weights: [0.90, 0.00] class: 0 | | | | | | | | | | |--- Experience > 20.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- Education > 1.50 | | | | | | | | | | |--- Income <= 88.00 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | | |--- Income > 88.00 | | | | | | | | | | | |--- weights: [0.00, 4.25] class: 1 | | | | | | |--- Mortgage > 152.00 | | | | | | | |--- Experience <= 11.00 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | |--- Experience > 11.00 | | | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | |--- ZIPCode > 95041.50 | | | | | | |--- Online <= 0.50 | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | |--- Online > 0.50 | | | | | | | |--- weights: [0.90, 0.00] class: 0 | | | |--- CCAvg > 3.95 | | | | |--- weights: [6.75, 0.00] class: 0 | | |--- CD_Account > 0.50 | | | |--- CCAvg <= 4.50 | | | | |--- weights: [0.00, 6.80] class: 1 | | | |--- CCAvg > 4.50 | | | | |--- weights: [0.15, 0.00] class: 0 |--- Income > 98.50 | |--- Education <= 1.50 | | |--- Family <= 2.50 | | | |--- Income <= 100.00 | | | | |--- CCAvg <= 4.20 | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | |--- CCAvg > 4.20 | | | | | |--- Age <= 54.50 | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | |--- Age > 54.50 | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | |--- Income > 100.00 | | | | |--- Income <= 103.50 | | | | | |--- Securities_Account <= 0.50 | | | | | | |--- weights: [2.10, 0.00] class: 0 | | | | | |--- Securities_Account > 0.50 | | | | | | |--- CreditCard <= 0.50 | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- CreditCard > 0.50 | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | |--- Income > 103.50 | | | | | |--- Income <= 104.50 | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | |--- Income > 104.50 | | | | | | |--- weights: [64.65, 0.00] class: 0 | | |--- Family > 2.50 | | | |--- Income <= 108.50 | | | | |--- ZIPCode <= 90147.00 | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | |--- ZIPCode > 90147.00 | | | | | |--- weights: [1.20, 0.00] class: 0 | | | |--- Income > 108.50 | | | | |--- ZIPCode <= 90019.50 | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | |--- ZIPCode > 90019.50 | | | | | |--- Age <= 26.00 | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | |--- Age > 26.00 | | | | | | |--- Income <= 113.50 | | | | | | | |--- Income <= 112.00 | | | | | | | | |--- weights: [0.00, 3.40] class: 1 | | | | | | | |--- Income > 112.00 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- Income > 113.50 | | | | | | | |--- Online <= 0.50 | | | | | | | | |--- weights: [0.00, 16.15] class: 1 | | | | | | | |--- Online > 0.50 | | | | | | | | |--- weights: [0.00, 25.50] class: 1 | |--- Education > 1.50 | | |--- Income <= 116.50 | | | |--- CCAvg <= 2.80 | | | | |--- Income <= 106.50 | | | | | |--- weights: [5.40, 0.00] class: 0 | | | | |--- Income > 106.50 | | | | | |--- Experience <= 31.50 | | | | | | |--- Experience <= 3.50 | | | | | | | |--- weights: [1.50, 0.00] class: 0 | | | | | | |--- Experience > 3.50 | | | | | | | |--- Family <= 3.50 | | | | | | | | |--- ZIPCode <= 93205.50 | | | | | | | | | |--- CCAvg <= 1.10 | | | | | | | | | | |--- weights: [0.90, 0.00] class: 0 | | | | | | | | | |--- CCAvg > 1.10 | | | | | | | | | | |--- CreditCard <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | | | |--- CreditCard > 0.50 | | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | |--- ZIPCode > 93205.50 | | | | | | | | | |--- CreditCard <= 0.50 | | | | | | | | | | |--- weights: [1.20, 0.00] class: 0 | | | | | | | | | |--- CreditCard > 0.50 | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | |--- Family > 3.50 | | | | | | | | |--- ZIPCode <= 94887.00 | | | | | | | | | |--- CreditCard <= 0.50 | | | | | | | | | | |--- weights: [0.00, 2.55] class: 1 | | | | | | | | | |--- CreditCard > 0.50 | | | | | | | | | | |--- Income <= 111.00 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | | |--- Income > 111.00 | | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | |--- ZIPCode > 94887.00 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | |--- Experience > 31.50 | | | | | | |--- weights: [1.50, 0.00] class: 0 | | | |--- CCAvg > 2.80 | | | | |--- ZIPCode <= 90389.50 | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | |--- ZIPCode > 90389.50 | | | | | |--- Age <= 63.50 | | | | | | |--- Family <= 1.50 | | | | | | | |--- Online <= 0.50 | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | | |--- Online > 0.50 | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | |--- Family > 1.50 | | | | | | | |--- Income <= 99.50 | | | | | | | | |--- Mortgage <= 250.75 | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | |--- Mortgage > 250.75 | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | |--- Income > 99.50 | | | | | | | | |--- Experience <= 36.00 | | | | | | | | | |--- Family <= 2.50 | | | | | | | | | | |--- weights: [0.00, 4.25] class: 1 | | | | | | | | | |--- Family > 2.50 | | | | | | | | | | |--- weights: [0.00, 11.90] class: 1 | | | | | | | | |--- Experience > 36.00 | | | | | | | | | |--- Online <= 0.50 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | |--- Online > 0.50 | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | |--- Age > 63.50 | | | | | | |--- weights: [0.30, 0.00] class: 0 | | |--- Income > 116.50 | | | |--- ZIPCode <= 90017.50 | | | | |--- weights: [0.00, 0.85] class: 1 | | | |--- ZIPCode > 90017.50 | | | | |--- weights: [0.00, 187.85] class: 1
importances = dtree.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1, class_weight={0: 0.15, 1: 0.85})
# Grid of parameters to choose from
parameters = {
"max_depth": [5, 10, 15, None],
"criterion": ["entropy", "gini"],
"splitter": ["best", "random"],
"min_impurity_decrease": [0.00001, 0.0001, 0.01],
}
# Type of scoring used to compare parameter combinations
scorer = make_scorer(recall_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, criterion='entropy',
max_depth=5, min_impurity_decrease=0.01, random_state=1,
splitter='random')
confusion_matrix_sklearn_with_threshold(estimator, X_train, y_train)
dtree_model_train_perf_hyp = model_performance_classification_sklearn_with_threshold(
estimator, X_train, y_train
)
print("Training performance:")
dtree_model_train_perf_hyp
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.95 | 0.97 | 0.66 | 0.79 |
confusion_matrix_sklearn_with_threshold(estimator, X_test, y_test)
dtree_model_test_perf_hyp = model_performance_classification_sklearn_with_threshold(
estimator, X_test, y_test
)
print("Test performance:")
dtree_model_train_perf_hyp
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.95 | 0.97 | 0.66 | 0.79 |
plt.figure(figsize=(15, 10))
out = tree.plot_tree(
estimator,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
|--- Income <= 127.82 | |--- Income <= 49.22 | | |--- weights: [196.05, 0.00] class: 0 | |--- Income > 49.22 | | |--- Income <= 111.56 | | | |--- CCAvg <= 3.03 | | | | |--- CD_Account <= 0.41 | | | | | |--- weights: [187.20, 4.25] class: 0 | | | | |--- CD_Account > 0.41 | | | | | |--- weights: [6.45, 3.40] class: 0 | | | |--- CCAvg > 3.03 | | | | |--- weights: [20.10, 34.85] class: 1 | | |--- Income > 111.56 | | | |--- Family <= 2.16 | | | | |--- Education <= 1.45 | | | | | |--- weights: [14.85, 0.00] class: 0 | | | | |--- Education > 1.45 | | | | | |--- weights: [2.70, 16.15] class: 1 | | | |--- Family > 2.16 | | | | |--- weights: [2.25, 21.25] class: 1 |--- Income > 127.82 | |--- Education <= 2.38 | | |--- Education <= 1.75 | | | |--- Family <= 3.72 | | | | |--- Family <= 2.87 | | | | | |--- weights: [45.75, 0.00] class: 0 | | | | |--- Family > 2.87 | | | | | |--- weights: [0.00, 24.65] class: 1 | | | |--- Family > 3.72 | | | | |--- weights: [0.00, 11.05] class: 1 | | |--- Education > 1.75 | | | |--- weights: [0.00, 79.05] class: 1 | |--- Education > 2.38 | | |--- weights: [0.00, 86.70] class: 1
Observations from the tree:
Using the above extracted decision rules we can make interpretations from the decision tree model like:
Interpretations from other decision rules can be made similarly
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the 'criterion' brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
estimator.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
# Here we will see that importance of features has increased
Imp Income 0.54 Education 0.19 Family 0.15 CCAvg 0.11 CD_Account 0.01 Age 0.00 Experience 0.00 ZIPCode 0.00 Mortgage 0.00 Securities_Account 0.00 Online 0.00 CreditCard 0.00
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
The DecisionTreeClassifier provides parameters such as
min_samples_leaf and max_depth to prevent a tree from overfiting. Cost
complexity pruning provides another option to control the size of a tree. In
DecisionTreeClassifier, this pruning technique is parameterized by the
cost complexity parameter, ccp_alpha. Greater values of ccp_alpha
increase the number of nodes pruned. Here we only show the effect of
ccp_alpha on regularizing the trees and how to choose a ccp_alpha
based on validation scores.
Minimal cost complexity pruning recursively finds the node with the "weakest
link". The weakest link is characterized by an effective alpha, where the
nodes with the smallest effective alpha are pruned first. To get an idea of
what values of ccp_alpha could be appropriate, scikit-learn provides
DecisionTreeClassifier.cost_complexity_pruning_path that returns the
effective alphas and the corresponding total leaf impurities at each step of
the pruning process. As alpha increases, more of the tree is pruned, which
increases the total impurity of its leaves.
clf = DecisionTreeClassifier(random_state=1, class_weight={0: 0.15, 1: 0.85})
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.00 | -0.00 |
| 1 | 0.00 | -0.00 |
| 2 | 0.00 | -0.00 |
| 3 | 0.00 | -0.00 |
| 4 | 0.00 | -0.00 |
| 5 | 0.00 | -0.00 |
| 6 | 0.00 | -0.00 |
| 7 | 0.00 | -0.00 |
| 8 | 0.00 | -0.00 |
| 9 | 0.00 | -0.00 |
| 10 | 0.00 | -0.00 |
| 11 | 0.00 | -0.00 |
| 12 | 0.00 | 0.00 |
| 13 | 0.00 | 0.00 |
| 14 | 0.00 | 0.00 |
| 15 | 0.00 | 0.00 |
| 16 | 0.00 | 0.00 |
| 17 | 0.00 | 0.00 |
| 18 | 0.00 | 0.00 |
| 19 | 0.00 | 0.00 |
| 20 | 0.00 | 0.00 |
| 21 | 0.00 | 0.00 |
| 22 | 0.00 | 0.01 |
| 23 | 0.00 | 0.01 |
| 24 | 0.00 | 0.01 |
| 25 | 0.00 | 0.01 |
| 26 | 0.00 | 0.01 |
| 27 | 0.00 | 0.01 |
| 28 | 0.00 | 0.01 |
| 29 | 0.00 | 0.01 |
| 30 | 0.00 | 0.01 |
| 31 | 0.00 | 0.02 |
| 32 | 0.00 | 0.02 |
| 33 | 0.00 | 0.02 |
| 34 | 0.00 | 0.02 |
| 35 | 0.00 | 0.02 |
| 36 | 0.00 | 0.03 |
| 37 | 0.00 | 0.03 |
| 38 | 0.00 | 0.03 |
| 39 | 0.00 | 0.03 |
| 40 | 0.00 | 0.03 |
| 41 | 0.00 | 0.03 |
| 42 | 0.00 | 0.04 |
| 43 | 0.00 | 0.04 |
| 44 | 0.00 | 0.04 |
| 45 | 0.01 | 0.05 |
| 46 | 0.01 | 0.06 |
| 47 | 0.01 | 0.07 |
| 48 | 0.02 | 0.09 |
| 49 | 0.06 | 0.21 |
| 50 | 0.25 | 0.47 |
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
Next, we train a decision tree using the effective alphas. The last value
in ccp_alphas is the alpha value that prunes the whole tree,
leaving the tree, clfs[-1], with one node.
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(
random_state=1, ccp_alpha=ccp_alpha, class_weight={0: 0.15, 1: 0.85}
)
clf.fit(X_train, y_train)
clfs.append(clf)
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.25379571489480884
For the remainder, we remove the last element in
clfs and ccp_alphas, because it is the trivial tree with only one
node. Here we show that the number of nodes and tree depth decreases as alpha
increases.
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
recall_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = recall_score(y_train, pred_train)
recall_train.append(values_train)
recall_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = recall_score(y_test, pred_test)
recall_test.append(values_test)
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(
ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post",
)
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
Maximum value of Recall is at 0.009 alpha, but if we choose decision tree will only have a root node and we would lose the buisness rules, instead we can choose alpha 0.001 retaining information and getting higher recall.
# creating the model where we get highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.009008434301508094,
class_weight={0: 0.15, 1: 0.85}, random_state=1)
best_model.fit(X_train, y_train)
DecisionTreeClassifier(ccp_alpha=0.009008434301508094,
class_weight={0: 0.15, 1: 0.85}, random_state=1)
confusion_matrix_sklearn_with_threshold(best_model, X_train, y_train)
dtree_model_train_perf = model_performance_classification_sklearn_with_threshold(
best_model, X_train, y_train
)
print("Training performance:")
dtree_model_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.94 | 0.99 | 0.59 | 0.74 |
confusion_matrix_sklearn_with_threshold(best_model, X_test, y_test)
dtree_model_test_perf = model_performance_classification_sklearn_with_threshold(
best_model, X_test, y_test
)
print("Test performance:")
dtree_model_test_perf
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.94 | 0.99 | 0.62 | 0.76 |
plt.figure(figsize=(5, 5))
out = tree.plot_tree(
best_model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
best_model2 = DecisionTreeClassifier(
ccp_alpha=0.01, class_weight={0: 0.15, 1: 0.85}, random_state=1
)
best_model2.fit(X_train, y_train)
DecisionTreeClassifier(ccp_alpha=0.01, class_weight={0: 0.15, 1: 0.85},
random_state=1)
confusion_matrix_sklearn_with_threshold(best_model2, X_train, y_train)
dtree_model_train_perf_postprune = model_performance_classification_sklearn_with_threshold(
best_model2, X_train, y_train
)
print("Training performance:")
dtree_model_train_perf_postprune
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.94 | 0.99 | 0.59 | 0.74 |
confusion_matrix_sklearn_with_threshold(best_model, X_test, y_test)
dtree_model_test_perf_postprune = model_performance_classification_sklearn_with_threshold(
best_model2, X_test, y_test
)
print("Test performance:")
dtree_model_test_perf_postprune
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.94 | 0.99 | 0.62 | 0.76 |
plt.figure(figsize=(15, 10))
out = tree.plot_tree(
best_model2,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(best_model2, feature_names=feature_names, show_weights=True))
|--- Income <= 98.50 | |--- CCAvg <= 2.95 | | |--- weights: [374.10, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- weights: [18.60, 18.70] class: 1 |--- Income > 98.50 | |--- Education <= 1.50 | | |--- Family <= 2.50 | | | |--- weights: [67.65, 2.55] class: 0 | | |--- Family > 2.50 | | | |--- weights: [1.65, 45.90] class: 1 | |--- Education > 1.50 | | |--- Income <= 116.50 | | | |--- weights: [13.35, 25.50] class: 1 | | |--- Income > 116.50 | | | |--- weights: [0.00, 188.70] class: 1
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the 'criterion' brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
best_model2.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp Income 0.65 Family 0.16 Education 0.14 CCAvg 0.06 Age 0.00 Experience 0.00 ZIPCode 0.00 Mortgage 0.00 Securities_Account 0.00 CD_Account 0.00 Online 0.00 CreditCard 0.00
importances = best_model2.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
models_train_comp_df = pd.concat(
[
dtree_model_train_perf_Prepune.T,
dtree_model_train_perf_hyp.T,
dtree_model_train_perf_postprune.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree Model Preprune",
"Decision Tree Model Hyper Parameter Tuning",
"Decision Tree Model Postprune",
]
print("Train set performance comparison:")
models_train_comp_df
Train set performance comparison:
| Decision Tree Model Preprune | Decision Tree Model Hyper Parameter Tuning | Decision Tree Model Postprune | |
|---|---|---|---|
| Accuracy | 1.00 | 0.95 | 0.94 |
| Recall | 1.00 | 0.97 | 0.99 |
| Precision | 1.00 | 0.66 | 0.59 |
| F1 | 1.00 | 0.79 | 0.74 |
models_test_comp_df = pd.concat(
[
dtree_model_test_perf_Prepune.T,
dtree_model_test_perf_hyp.T,
dtree_model_test_perf_postprune.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Decision Tree Model Preprune",
"Decision Tree Model Hyper Parameter Tuning",
"Decision Tree Model Postprune",
]
print("Test set performance comparison:")
models_test_comp_df
Test set performance comparison:
| Decision Tree Model Preprune | Decision Tree Model Hyper Parameter Tuning | Decision Tree Model Postprune | |
|---|---|---|---|
| Accuracy | 0.98 | 0.95 | 0.94 |
| Recall | 0.89 | 0.94 | 0.99 |
| Precision | 0.88 | 0.67 | 0.62 |
| F1 | 0.88 | 0.78 | 0.76 |